AITopics | splice site

Collaborating Authors

splice site

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DeepVRegulome: DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome

Dutta, Pratik, Obusan, Matthew, Sathian, Rekha, Chao, Max, Surana, Pallavi, Papineni, Nimisha, Ji, Yanrong, Zhou, Zhihan, Liu, Han, Yurovsky, Alisa, Davuluri, Ramana V

arXiv.org Artificial IntelligenceNov-13-2025

Whole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.

artificial intelligence, machine learning, variant, (20 more...)

arXiv.org Artificial Intelligence

2511.09026

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Therapeutic Area > Oncology > Brain Cancer (0.58)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CircFormerMoE: An End-to-End Deep Learning Framework for Circular RNA Splice Site Detection and Pairing in Plant Genomes

Jiang, Tianyou

arXiv.org Artificial IntelligenceJul-14-2025

Circular RNAs (circRNAs) are important components of the non-coding RNA regulatory network. Previous circRNA identification primarily relies on high-throughput RNA sequencing (RNA-seq) data combined with alignment-based algorithms that detect back-splicing signals. However, these methods face several limitations: they can't predict circRNAs directly from genomic DNA sequences and relies heavily on RNA experimental data; they involve high computational costs due to complex alignment and filtering steps; and they are inefficient for large-scale or genome-wide circRNA prediction. The challenge is even greater in plants, where plant circRNA splice sites often lack the canonical GT-AG motif seen in human mRNA splicing, and no efficient deep learning model with strong generalization capability currently exists. Furthermore, the number of currently identified plant circRNAs is likely far lower than their true abundance. In this paper, we propose a deep learning framework named CircFormerMoE based on transformers and mixture-of experts for predicting circRNAs directly from plant genomic DNA. Our framework consists of two subtasks known as splicing site detection (SSD) and splicing site pairing (SSP). The model's effectiveness has been validated on gene data of 10 plant species. Trained on known circRNA instances, it is also capable of discovering previously unannotated circRNAs. In addition, we performed interpretability analyses on the trained model to investigate the sequence patterns contributing to its predictions. Our framework provides a fast and accurate computational method and tool for large-scale circRNA discovery in plants, laying a foundation for future research in plant functional genomics and non-coding RNA annotation.

artificial intelligence, machine learning, sequence, (17 more...)

arXiv.org Artificial Intelligence

2507.08542

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Horizon-wise Learning Paradigm Promotes Gene Splicing Identification

Li, Qi-Jie, Sun, Qian, Zhang, Shao-Qun

arXiv.org Artificial IntelligenceJun-15-2024

Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.

prediction, sequence, splice site, (15 more...)

arXiv.org Artificial Intelligence

2406.119

Country: Asia > China > Jiangsu Province > Nanjing (0.05)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Identifying DNA Sequence Motifs Using Deep Learning

Poddar, Asmita, Uzun, Vladimir, Tunbridge, Elizabeth, Haerty, Wilfried, Nevado-Holgado, Alejo

arXiv.org Artificial IntelligenceNov-20-2023

Advancements in genomic technologies [22] have enabled the generation of vast amounts of DNA sequence data, enabling the structural and functional study of the human genome [3]. This has been valuable in providing insights into the genetic basis of diseases. Since splice sites play a crucial role in this, accurately predicting these sites in DNA sequences is essential for diagnosing diseases. The internal structure of DNA sequences consists of alternating protein-coding regions that contain information for the production of proteins called exons, and non-coding regions that interrupt the coding sequence called introns, as shown in Figure 1. Mutations in the intronic region have been associated with developmental disorders such as isolated Pierre Robin sequence [28] and several types of cancer [31]. Hence, the accurate identification of boundaries between the exons and introns, known as splice sites, has special biological significance with healthcare implications, such as to study the sources of unresolved genetic disease. Current annotations of the human genome to identify where exons/introns are located in the sequence [10] are far from complete and much remains unknown about the DNA sequence motifs that indicate the presence of a splice site. We aim to address this challenge computationally by predicting the presence of splice sites -- both the exon start (intron end) sites, known as acceptor sites, and the exon end (intron start) sites, known as donor sites -- within the DNA sequence.

dna sequence, sequence, splice site, (15 more...)

arXiv.org Artificial Intelligence

2311.12884

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.54)
Health & Medicine > Therapeutic Area > Genetic Disease (0.34)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Transcriptomic signatures across human tissues identify functional rare genetic variation

ScienceSep-10-2020, 17:41:02 GMT

Every human genome contains tens of thousands of rare genetic variants—which include single nucleotide changes, insertions or deletions, and larger structural variants—and some may have a functional effect. Ferraro et al. examined data from individuals in the Genotype-Tissue Expression (GTEx) project for outliers across tissues caused by gene expression, splicing, and allele-specific expression. Single rare variants were observed that affected the expression and allele-specific expression of multiple genes and, in the case of a gene fusion event, splicing. Experimental and computational validation suggest that many individuals carry more than 50 rare variants that affect transcription in some way. Although most variants were predicted to not affect an individual's phenotype, a small percentage showed likely disease-related associations, emphasizing the importance of studying the impact of rare genetic variation on the transcriptome. Science , this issue p. [eaaz5900][1] ### INTRODUCTION The human genome contains tens of thousands of rare (minor allele frequency <1%) variants, some of which contribute to disease risk. Using 838 samples with whole-genome and multitissue transcriptome sequencing data in the Genotype-Tissue Expression (GTEx) project version 8, we assessed how rare genetic variants contribute to extreme patterns in gene expression (eOutliers), allelic expression (aseOutliers), and alternative splicing (sOutliers). We integrated these three signals across 49 tissues with genomic annotations to prioritize high-impact rare variants (RVs) that associate with human traits. ### RATIONALE Outlier gene expression aids in identifying functional RVs. Transcriptome sequencing provides diverse measurements beyond gene expression, including allele-specific expression and alternative splicing, which can provide additional insight into RV functional effects. ### RESULTS After identifying multitissue eOutliers, aseOutliers, and sOutliers, we found that outlier individuals of each type were significantly more likely to carry an RV near the corresponding gene. Among eOutliers, we observed strong enrichment of rare structural variants. sOutliers were particularly enriched for RVs that disrupted or created a splicing consensus sequence. aseOutliers provided the strongest enrichment signal when evaluated from just a single tissue. We developed Watershed, a probabilistic model for personal genome interpretation that improves over standard genomic annotation–based methods for scoring RVs by integrating these three transcriptomic signals from the same individual and replicates in an independent cohort. To assess whether outlier RVs identified in GTEx associate with traits, we evaluated these variants for association with diverse traits in the UK Biobank, the Million Veterans Program, and the Jackson Heart Study. We found that transcriptome-assisted prioritization identified RVs with larger trait effect sizes and were better predictors of effect size than genomic annotation alone. ### CONCLUSION With >800 genomes matched with transcriptomes across 49 tissues, we were able to study RVs that underlie extreme changes in the transcriptome. To capture the diversity of these extreme changes, we developed and integrated approaches to identify expression, allele-specific expression, and alternative splicing outliers, and characterized the RV landscape underlying each outlier signal. We demonstrate that personal genome interpretation and RV discovery is enhanced by using these signals. This approach provides a new means to integrate a richer set of functional RVs into models of genetic burden, improve disease gene identification, and enable the delivery of precision genomics. ![Figure][2] Transcriptomic signatures identify functional rare genetic variation. We identified genes in individuals that show outlier expression, allele-specific expression, or alternative splicing and assessed enrichment of nearby rare variation. We integrated these three outlier signals with genomic annotation data to prioritize functional RVs and to intersect those variants with disease loci to identify potential RV trait associations. Rare genetic variants are abundant across the human genome, and identifying their function and phenotypic impact is a major challenge. Measuring aberrant gene expression has aided in identifying functional, large-effect rare variants (RVs). Here, we expanded detection of genetically driven transcriptome abnormalities by analyzing gene expression, allele-specific expression, and alternative splicing from multitissue RNA-sequencing data, and demonstrate that each signal informs unique classes of RVs. We developed Watershed, a probabilistic model that integrates multiple genomic and transcriptomic signals to predict variant function, validated these predictions in additional cohorts and through experimental assays, and used them to assess RVs in the UK Biobank, the Million Veterans Program, and the Jackson Heart Study. Our results link thousands of RVs to diverse molecular effects and provide evidence to associate RVs affecting the transcriptome with human traits. [1]: /lookup/doi/10.1126/science.aaz5900 [2]: pending:yes

bioinformatics, machine learning, variant, (20 more...)

Science

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
(37 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.68)

Technology:

Information Technology > Biomedical Informatics (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)

Add feedback

Developing parsimonious ensembles using ensemble diversity within a reinforcement learning framework

Stanescu, Ana, Pandey, Gaurav

arXiv.org Machine LearningMay-5-2018

Heterogeneous ensembles built from the predictions of a wide variety and large number of diverse base predictors represent a potent approach to building predictive models for problems where the ideal base/individual predictor may not be obvious. Ensemble selection is an especially promising approach here, not only for improving prediction performance, but also because of its ability to select a collectively predictive subset, often a relatively small one, of the base predictors. In this paper, we present a set of algorithms that explicitly incorporate ensemble diversity, a known factor influencing predictive performance of ensembles, into a reinforcement learning framework for ensemble selection. We rigorously tested these approaches on several challenging problems and associated data sets, yielding that several of them produced more accurate ensembles than those that don't explicitly consider diversity. More importantly, these diversity-incorporating ensembles were much smaller in size, i.e., more parsimonious, than the latter types of ensembles. This can eventually aid the interpretation or reverse engineering of predictive models assimilated into the resultant ensemble(s).

data mining, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

1805.02103

Country: North America > United States (0.68)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Zeller, Georg, Goernitz, Nico, Kahles, Andre, Behr, Jonas, Mudrakarta, Pramod, Sonnenburg, Soeren, Raetsch, Gunnar

arXiv.org Machine LearningSep-20-2013

High-throughput sequencing technology applied to cellular mRNA (RNA-Seq) has revolutionized transcriptome studies [19, 17, 35, among many others]. In contrast to microarray platforms, which it has replaced in many applications, RNA-Seq can not only be used to accurately quantify known transcripts, but also to reveal the precise structure of transcripts at single-nucleotide resolution. RNA-Seq based transcript reconstruction has therefore become a valuable tool for the completion of genome annotations [22, 11, for instance] and further enabled subsequent analyses of differentially expressed genes [2], transcript isoforms [6, 4] and exons [3], all of which generally rely on correctly inferred transcript inventories. De novo transcript reconstruction is thus a pivotal step in the analysis of RNA-Seq data. There are two conceptually different strategies to approach this problem: one can either assemble transcripts directly from RNA-Seq reads using methodology that originated from genome assembly approaches [13, 23, 25].

artificial intelligence, bioinformatics, machine learning, (18 more...)

arXiv.org Machine Learning

1309.5211

Country: Europe > Germany (0.47)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Biomedical Informatics > Translational Bioinformatics (0.90)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.30)

Add feedback

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

Schweikert, Gabriele, Rätsch, Gunnar, Widmer, Christian, Schölkopf, Bernhard

Neural Information Processing SystemsDec-31-2009

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.

artificial intelligence, machine learning, organism, (16 more...)

Neural Information Processing Systems

Country: Europe > Germany (0.16)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback